Research on short-term photovoltaic power generation forecasting model based on multi-strategy improved squirrel search algorithm and support vector machine

Solar photovoltaic (PV) power generation is susceptible to environmental factors, and redundant features can disrupt prediction accuracy. To achieve rapid and accurate online prediction, we propose a method that combines Principal Component Analysis (PCA) with a multi-strategy improved Squirrel Search Algorithm (SSA) to optimize Support Vector Machine (MISSA-SVM) for prediction. Initially, to mitigate the impact of redundant features on prediction accuracy, KPCA is employed for feature dimensionality reduction. Subsequently, SVM is suggested as the foundational algorithm for constructing the prediction model. Furthermore, to address the influence of hyperparameter selection on model performance, SSA is introduced for optimizing SVM hyperparameters, with the aim of establishing the optimal prediction model. Moreover, to enhance solution efficiency and accuracy, a multi-strategy approach termed MISSA is proposed, which integrates Population Initialization based on the Tent map, Nonlinear Predator Presence Probability, Chaotic-based Dynamic Opposition-based Learning, and Selection Strategy, to refine SSA. Finally, through case studies, the performance of MISSA optimization is assessed using challenging CEC2021 test functions, demonstrating its high optimization performance, stability, and significance. Subsequently, the performance of the prediction model is validated using two datasets, showcasing that the proposed prediction method achieves high accuracy and robust prediction stability.

construct the mid-long-term load prediction model, and the results show that the improved sparrow search algorithm-SVM has a better prediction performance.The authors had not improved the initial population and updated formula of the sparrow search algorithm, and the improved method cannot significantly improve the optimization performance of sparrow search algorithm 34 .
In summation, a methodology is introduced that combines PCA with MISSA-SVM prediction model.In order to enhance diagnostic efficiency and reduce the influence of unimportant features on predictions, PCA is employed for feature dimensionality reduction.Simultaneously, to enhance the diagnostic performance of the SVM model and address the impact of hyperparameters on model performance, a multi-strategy improved SSA, based on Population Initialization using the Tent map, Nonlinear Predator Presence Probability, Chaotic-based Dynamic Opposition-based Learning, and Selection Strategy, is proposed for optimizing SVM hyperparameters.It is noteworthy that the proposed model is not suitable for the field of photovoltaic forecasting with large datasets.
The contributions made in this paper are as follows.First, PCA is used to analyze input data and reduce the influence of the principal component with low contribution rates of prediction accuracy.To build the optimal model, this paper then proposes an improved squirrel search algorithm (SSA) based on a multi-strategy, which uses a tent map to improve the diversity of the squirrel population.The nonlinear P dp is used to increase the optimization ability of the squirrel population in the latter stage, and chaotic map-based dynamic oppositionbased learning (OBL) and a selection strategy for differential evolution (DE) are used to improve the convergence speed and optimization accuracy of the SSA.In addition, benchmark test functions are used to test the optimization performance of the MISSA and the other four typical metaheuristic algorithms.The results show that the MISSA has the best optimization performance.Finally, using the data of two typical months, the proposed model is compared with the other four prediction models.The results show that the root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE) of the proposed model are superior, and the corresponding fitness curve is also superior, proving that the accuracy and effectiveness of the proposed prediction model are strong.
The main structure of this paper is as follows: "Support vector machine" introduces the basic theory of SVM; "Improved squirrel search algorithm" presents the basic theory of SSA and its improvement strategies, and tests the optimization performance of MISSA; "Prediction model of PV power generation based on MISSA-SVM" establishes a photovoltaic power generation prediction model based on MISSA-SVM, and introduces the prediction process; "Prediction of PV power generation based on MISSA-SVM" uses two simulation examples for analysis; "Conclusion" concludes the entire article and provides future prospects.

Support vector machine
The basic principle of an SVM is to map the sample data to high-dimensional space through nonlinear mapping and then carry out linear regression in high-dimensional space 35 .The basic model is as follows: When performing regression analysis, the SVM allows for a deviation, ε, between f (x) and y: where C is the regularization constant and l ε is the insensitive loss function.Introducing relaxation variables ξ and Lagrange multipliers can be gave: where the Lagrange multipliers α, α, µ, μ 0 let the partial derivative of the Lagrange function be ω, b, ξ , ξ = 0, and the original problem is transformed into a dual problem that meets the KKT condition.After inputting Eq. (1), the objective function of the SVM regression is: where K(x i , x j ) is the kernel function.The radial basis function has only one parameter, excellent generalization ability, and good performance in processing nonlinear data: where σ is the kernel function parameter.To find the optimal hyperparameters (σ and C) of SVM, MISSA is proposed in this paper. (1)

Improved squirrel search algorithm
SSA is an optimization algorithm based on squirrel foraging behavior proposed by Mohit Jain et al. in 2017 36,37 .
Because SSA has some basic research, this paper will not repeat its basic theory.At the same time, after analyzing its principle and optimization process, in this paper, a strategic method to improve the shortcomings of SSA is proposed.

Improved method based on squirrel search algorithm
Compared with traditional optimization algorithms, the SSA has better optimization performance, but it still has problems with poor population diversity and a tendency to fall into local optimization.Therefore, this section proposes three methods to improve the SSA.

Population initialization based on Tent map
In view of the poor diversity of the initial population of squirrels, Tent map is used to initialize the population.Tent map has the characteristics of strong ergodicity and randomness 38 .The specific formula is as follows: Figure 1 shows the initial population generated by two population initialization methods: Figure 1 shows that, compared with the original population initialization method, the population ergodicity and randomness mapped by the tent map are higher, which will improve the diversity of the initial squirrel population.

Nonlinear predator presence probability
Addressing the defects of P dp can give the squirrel population the probability to jump out of local optimization and conduct global optimization again at a later stage.Based on the sigmoid function, this paper proposes a (6)  www.nature.com/scientificreports/probability formula of nonlinear attenuation to balance the optimization behavior of squirrels and improve the optimization performance.The sigmoid function is as follows 39 , and the related image is shown in Fig. 2: The improved formula based on the sigmoid function is as follows, and the specific function image is shown in Fig. 3.In Fig. 3, P pd will decay nonlinearly, rapidly approaching 0 from 0.1; t is the current iteration number, and T is the maximum iteration number.This method can effectively improve the foraging ability of the squirrel population and reduce the probability of random foraging.

Chaotic based dynamic opposition-based learning
Aiming at the squirrel population iteration's tendency to fall into local optimization, after the position update, this paper proposes a chaotic based dynamic OBL to update the squirrel position.The OBL is as follows: where ub and lb are the upper and lower bounds of the population, respectively.On this basis, and considering the elite OBL 23 , this paper proposes a chaotic based dynamic OBL.The specific formula is as follows:  where dy(ub)and dy(lb) are the upper and lower dynamic limits respectively, which are mainly determined by the maximum and minimum values of the squirrel population in its current iteration.At the same time, using a Tent map replaces the random number generated by R and increases the randomness of the position update.Figure 4 shows the random sequence generated by two methods in 50 iterations.
As shown in Fig. 4, the sequence generated by the Tent map has higher randomness and ergodicity.

Selection strategy
After the dynamic OBL, the selection strategy of DE is used to select the optimal location of the squirrel population.
where FS new is the solution generated by the original population renewal formula, FS dyol is the solution generated based on dynamic reverse learning, and f(FS new ) and f(FS dyol ) are fitness values.After the dynamic OBL increases the optimization performance of the squirrel population, the optimal population in the iteration is maintained by a selection strategy.

Multi-strategy improved squirrel search algorithm summary
This paper uses the proposed method to improve the SSA and achieve MISSA to improve the performance of the optimization.
1. Improved method (1)-population initialization based on a Tent map: The original random initialization of SSA is replaced by Eq. ( 6), and the diversity of the squirrel population generated based on tent initialization is increased.2. Improved method (2)-nonlinear predator presence probability: P dp based on the sigmoid function, such as Eq. ( 8), is proposed.Compared with the traditional P pd , the proposed formula can quickly approach 0 from 0.1 for nonlinear attenuation to increase the optimization ability of the SSA in the middle and late stages.3. Improved method (3)-chaotic based dynamic OBL: A chaotic dynamic OBL is proposed to obtain the reverse solution of the squirrel population before the end of the current iteration, which will greatly improve the optimization performance of the SSA. 4. Improved method (4)-selection strategy: A selection strategy based on DE is proposed to select the optimal solution from the reverse solution and the current solution for assignment to maximize the advantages of the population and improve the optimization performance of the population again.
The optimization process of MISSA will be represented by a pseudo code, as shown in Algorithm 1.At the same time, in order to verify the optimization performance of MISSA, it should be compared with the other four metaheuristic algorithms.( 10) Vol.:(0123456789)This paper presents the optimization performance of the proposed optimization algorithm on challenging CEC2021 benchmark test functions and compares it with the optimization results of GNDO, DBO, WSO, and SSA.The effectiveness of the proposed algorithm is evaluated using CEC2021 benchmark test functions, which comprehensively test the optimization performance on Basic functions, Hybrid Functions, and Composition Functions sets.The parameter settings for all algorithms are provided in Table 1.Additionally, since the SVM hyperparameter optimization problem is a two-dimensional optimization problem, the dimensionality of all test functions is set to 2. Each optimization algorithm is run 30 times on each test function, and the results are statistically summarized using three metrics: best value, average value, and standard deviation.The CEC2021 functions are introduced in Table 2, and the average best fitness curves are illustrated in Fig. 5 and summarized in Table 3.
The optimization performance of GNDO, DBO, WSO, SSA, and MISSA was tested and compared.In the Basic Functions, MISSA consistently found the minimum value of the functions with high efficiency.Analyzing the results of the Hybrid Functions optimization, several advanced algorithms tended to get trapped in local optima with lower precision, while MISSA demonstrated sustained performance and higher optimization accuracy.For the Composition Functions optimization, MISSA maintained fast optimization efficiency while ensuring high precision in solutions.These results effectively demonstrate the advantages of the proposed optimization method.To further validate its superiority and stability, the evaluation metrics including Best, Mean, and Std were used to summarize the results of 30 tests, as shown in Table 3.
In Table 3, all optimal metrics are highlighted in bold font.In the Basic Functions, MISSA consistently found the minimum value of the functions in all 30 runs, demonstrating stable optimization performance.Analyzing the results of the Hybrid Functions optimization, MISSA achieved the minimum value with significantly better average and standard deviation compared to other algorithms.Regarding the Composition Functions optimization, MISSA consistently found the minimum values for F8 and F10, while for F9, DBO and MISSA achieved the same optimal value, but MISSA exhibited extremely high stability in its solutions.
The above analysis compares MISSA with four other advanced optimization algorithms in solving the CEC2021 test functions, effectively demonstrating the superior optimization performance of MISSA.Furthermore, to validate the significance of MISSA, we conducted Wilcoxon tests.
Wilcoxon is a statistical learning method used to test the significance of models, with its primary parameter being the p-value.For example, considering the entry at row 2, column 2 of the Table 4 (p-value equals 1.2118e−12), this implies that p < 0.05.Thus, compared to GNDO, MISSA exhibits significant differences.On the other hand, if p ≥ 0.05, the opposite result would occur, as seen in row five, column four of the Table 4.In conclusion, it can be inferred that MISSA demonstrates strong significance.

Prediction model of PV power generation based on MISSA-SVM
This section will specifically introduce the prediction model of PV power generation based on MISSA-SVM.

Data pre-processing
The data in this paper comes from the power generation data of a 23.4 kW PV power station between the times of 8 a.m. and 5 p.m. 33 .Additionally, for the effectiveness of the experiment, two typical data sets were used from the months of June and December.The number of data characteristics is shown in Table 5, and the data from June and December are shown in Tables 6 and 7, respectively.
The dataset is set up with sliding windows.Assuming the original data size is m × n, and the length of the sliding window is set to l, the entire rectangular sliding window size would be l × n.Typically, the sliding step size is set to 1, resulting in (m − l + 1) rectangular datasets after sliding window processing.This study involves two datasets, each spanning 30 days.Each day's data consists of 10 entries with 6 features, making the total data size 300 × 6.Based on experimentation, it was found that a sliding window size of 70, equivalent to 7 days, yields the best prediction performance.Hence, the sliding window size obtained is 70 × 6, resulting in a final set of 24 rectangular datasets, each sized 20 × 6.To reduce the complexity of the data and eliminate the feature quantities with low correlations, PCA is used to process the two data sets.The specific results are shown in Table 8, where CR represents the principal component contribution rate and TCR represents the total contribution rate.
According to the results in Table 8, the principal component contribution rate of this research is 95%.Therefore, in the two data sets, the first five principal components are selected as the input of the model.
The training set and test set are then allocated.In the data set from June, 270 groups of data from the first 27 days are used as the training set, and 30 groups from the last three days are used as the test set.In the data set from December, 280 groups of data from the first 28 days are used as training sets, and 30 groups from the last three days are used as test sets.
To validate the effectiveness of PCA, we utilized the MISSA algorithm to predict the sample data before and after feature extraction.The specific results are outlined in Table 9.
As shown in Table 9, predictions and prediction times are provided for both datasets before and after feature extraction.It is evident that both the prediction accuracy and time improved after applying PCA compared to the original dataset, thus demonstrating the effectiveness of PCA.

Data normalization
Because the feature quantity is complex and the numerical difference is too large, to increase the prediction accuracy, all data will be normalized before the prediction and all the data will be normalized to (− 1, 1).The specific formula used is as follows: At the same time, to get the correct prediction results, the predicted data will be de-normalized, and the results will be outputted.

Prediction process of PV power generation based on MISSA-SVM
This section elaborates on the specific process of PV power generation prediction based on MISSA-SVM, as shown in Fig. 6.Select the data from the final three days as the test set, and the remaining data as the training set.2. PCA: According to the standard of 95% of the total contribution rate of the principal components, a PCA is used to process the PV power generation data.3. Parameter optimization: Hyperparameters, C, and σ of SVM are optimized with MISSA, and the specific optimization process is shown in pseudo code Algorithm 1. 4. Power generation forecast: Bring the super parameters optimized by MISSA into SVM to establish a prediction model and input the test set for prediction (In Example analysis 1: C = 913.934,σ = 0.0445.In Example analysis 2: C = 868.771,σ = 0.001).5. Result output: The prediction of the PV power generation is completed, and the prediction results are outputted, including the prediction curve, residual curve, RMSE, MSE, MAE, and operation time.

Prediction of PV power generation based on MISSA-SVM
To verify the performance of the prediction model in this paper, in this section of the proposed model, references 24,38,39 and some common models are used to predict the power generation and compare the prediction results.The specific models are MISSA-SVM GA-SVM, PSO-SVM, SSA-SVR, and GWO-BPNN.Three evaluation indicators are adopted, namely RMSE, MSE, MAE and R coefficient.

Example analysis 1
This section analyzes the prediction results of the power generation during the last three days of June, with 30 time periods between 8 a.m. and 5 p.m.The number of iterations of all models is set to 50, and the population number of the algorithm is 30.Table 10 shows three evaluation indicators of the prediction results, Fig. 7 shows the prediction results, Fig. 8 shows the prediction residuals, and Fig. 9 shows the fitness values.First, As can be seen from the prediction results in Table 10, compared with the other four models, three evaluation indicators of the proposed model are all the smallest, with the values 1.536, 2.361, and 1.230 respectively.Three evaluation indicators of the other three SVM-based models is larger than that of the proposed model, but the prediction time of this model increased due to the improved methods (3 and 4).Because the PSO model is simple, the prediction time of the PSO-SVM is the lowest; however, the result of prediction of PSO-SVM is the worse.At the same time, PSO-BPNN is analyzed.Because of the complexity of the BPNN model and its dependence on large sample data, the prediction time is the longest and three evaluation indicators of this model is the worst of the five.
Then, analyze Figs. 8 and 9, it can be seen that the prediction curve of the proposed model is the best fit.Compared with the actual results, the prediction results of MISSA-SVM do not have a large prediction error.The prediction results of the other four models are poor, and there is a certain gap between the curve fitting and the prediction results of the model proposed in this paper.The residual curve jitter of the proposed model is also the minimum out of the five.Furthermore, Fig. 9 shows that the fitness curve of the model in this paper converges the fastest.Because of improved method (1), the initial value of the curve is the best, and the optimal fitness value has been found around the seventh generation.

Principal component analysis
Finally, by comprehensively analyzing the above prediction results, it can be concluded that compared with the other four prediction models, the prediction performance of the model proposed in this paper has advantages.
It is noting that one data set is used for testing cannot fully prove the effectiveness of the proposed model.Therefore, in "Example analysis 2", the data in December are predicted and analyzed to further prove the effectiveness and superiority of the proposed model.

Example analysis 2
In this section, five models will be used to predict the PV power generation during the last three days of December.The specific data time tag is the same as the parameter setting of the model analyzed in "Example analysis 1".Table 11 shows three evaluation indicators of the prediction results, Fig. 10 shows the prediction results, Fig. 11 shows the prediction residuals, and Fig. 12 shows the fitness values.First, Table 11 is analyzed.Compared with the other four models, three evaluation indicators of the proposed model are, as was seen with the results of June, all the smallest, with values of 0.813, 0.660, and 0.489, respectively.Three evaluation indicators of the other three SVM-based models are larger than that of the proposed model, but the prediction time of this model is also increased.In comparison with example 1, the result of this example has is the same.Thus, it can be further proved that the prediction performance of the model in this paper is much higher than that of other models.
Then, Figs. 10 and 11 are analyzed.Compared with the actual results, the prediction results of MISSA-SVM only produced a large prediction error in the afternoon of the first day, and the prediction curve in other periods has a good fit.The prediction results of the other four models are poor and there is a certain gap between the curve fitting and the prediction results of the model proposed in this paper.
As shown Fig. 12, the fitness curve of the model proposed in this paper began to converge in the seventh generation, and compared with the other four models, the fitness value found is superior.In addition, the SSA-SVM found a better fitness value, but its curve began to converge only in the 17th generation.
Finally, according to Table 11, three evaluation indicators of the SSA-SVM has certain advantages, but it also takes a long time to predict because of the complexity of the optimization strategy of the SSA.
At the same time, it can be seen that in the SVM-based model, the prediction time is further increased due to the improved methods (3 and 4), but the three evaluation indicators obtained has significant advantages.By comprehensively analyzing the prediction results, MISSA-SVM can generate a more accurate prediction at the expense of a small part of the prediction time.

Conclusion
In order to improve the accuracy of the power generation prediction of PV power stations, a prediction model based on SVR is proposed.First, a PCA is used to analyze the power generation data and to propose the principal components with low correlation.Second, four improved methods are proposed based on tent chaos initialization, nonlinear attenuation of predator probability, chaos-based dynamic reverse learning strategy, and selection strategy to improve the SSA.Finally, the PV power generation prediction model based on the MISSA-SVR is established.The conclusions of this paper are as follows: 1. Six benchmark functions were used to test the optimization performance of the MISSA's other four metaheuristic algorithms.The test results effectively proved that the MISSA has the fastest optimization speed and the highest precision, which can ensure the accuracy of the MISSA in SVR super parameter optimization.www.nature.com/scientificreports/ 2. After using two kinds of data sets for example analysis, compared with the other prediction models, the MISSA-SVR greatly improved prediction accuracy at the expense of some prediction efficiency.The three evaluation indicators obtained by simulation are optimal, which proves the effectiveness of the prediction model proposed in this paper.To sum up, the multi-strategy improvement method proposed in this paper has a high reference value.At the same time, the proposed prediction model has a relatively excellent prediction effect, and it can also be applied to most prediction problems in related fields.
The methods proposed are only applicable to small-sample, low-dimensional photovoltaic power generation data.Future research will focus on validating these methods using online prediction techniques based on deep learning.Additionally, efforts will be made to develop prediction models suitable for various types of data by integrating the proposed methods.

Figure 6 .
Figure 6.Prediction process based on PCA and MISSA-SVM.

Figure 7 .
Figure 7. Prediction results based on June data.

Figure 8 .
Figure 8. Residual curve based on June data.

Figure 9 .
Figure 9. Fitness curve based on June data.

Figure 10 .
Figure 10.Prediction results based on December data.

Figure 11 .
Figure 11.Residual curve based on December data.

Figure 12 .
Figure 12.Fitness curve based on December data.

Table 1 .
Parameters of algorithms.

Table 3 .
Test results of CEC2021.Significant values are in bold.

Table 4 .
Test results of Wilcoxon.

Table 5 .
Number of data characteristics.

Table 6 .
Data from June.

Table 7 .
Data from december.
CR-Jun TCR-Jun CR-Dec TCR-Dec

Table 9 .
Comparison of prediction results based on PCA.Significant values are in bold.Data pre-processing: Number and classify the two data sets and divides into a training set and a test set.

Table 10 .
Prediction results based on June data.Significant values are in bold.

Table 11 .
Prediction results based on December data.Significant values are in bold.